Explorating GLOTREC catalogue¶

Gimena del Rio & Romina De León¶

(HDLAB CONICET)¶

Notebook designed and maintained by Romina De León¶

Goals:¶

  • Download data from the GLOTREC repository
  • Standardize and export the dataset for exploration and analysis
  • Clean and prepare GLOTREC data related to Argentine Textbooks
  • Exploring data and relationship between:
    • Authors and Publisher
    • Publisher and School Subjects
    • Publisher, Authors, School Subjects
  • Work on similar visualizations as the ones that can be found nowadays in GLOTREC, though improved with a focus on specific periods.

Libraries to use¶

Description:

  • pandas: for data cleaning, manipulation, and tabular representation
  • numpy: for efficient numerical and array operations
  • matplotlib and seaborn: for statistical and exploratory data visualization
  • re: for applying regular expressions in text normalization
  • openpyxl: for reading and exporting Excel files
  • networkx: for analyzing and visualizing relationships through network graphs
  • unidecode: for removing diacritics and standardizing text encodings
  • math: provide mathematical functions defined by C standard

Installation of packages if not already installed¶

In [167]:
# Only needed once to install packages
%pip install -q pandas numpy matplotlib seaborn unidecode openpyxl squarify networkx pyvis
Note: you may need to restart the kernel to use updated packages.

Import necessary libraries¶

In [168]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re, openpyxl, networkx as nx
from unidecode import unidecode
import plotly.express as px
from pyvis.network import Network
import itertools
import math
import plotly.graph_objects as go

Setup visualization aesthetics for plots¶

In [169]:
sns.set_theme(style="whitegrid", palette="mako")
plt.rcParams.update({
    "figure.facecolor": "white",
    "axes.titlesize": 12,
    "axes.labelsize": 10,
    "xtick.labelsize": 8,
    "ytick.labelsize": 8
})

Read the downloaded Excel file and display first rows¶

In [170]:
df = pd.read_excel(
    "data/itbc_export_2025.xlsx",
    usecols=lambda c: not c.startswith("Unnamed"),
    dtype={"Year": "float"}
)
print(df.info())
display(df.sample(5))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  335 non-null    object 
 1   Call Number         332 non-null    object 
 2   GLOTREC|Cat Link    335 non-null    object 
 3   Catalogue           335 non-null    object 
 4   Library Catalogue   335 non-null    object 
 5   Year                335 non-null    float64
 6   Publisher           335 non-null    object 
 7   Place               335 non-null    object 
 8   Title               335 non-null    object 
 9   Authors             321 non-null    object 
 10  Pages               333 non-null    object 
 11  Format              335 non-null    object 
 12  School Subject      335 non-null    object 
 13  Level of Education  335 non-null    object 
 14  Document Type       335 non-null    object 
 15  Country of Use      335 non-null    object 
dtypes: float64(1), object(15)
memory usage: 42.0+ KB
None
ID Call Number GLOTREC|Cat Link Catalogue Library Catalogue Year Publisher Place Title Authors Pages Format School Subject Level of Education Document Type Country of Use
169 654887500 RA G-5(1,68) gei654887500 GEI PPN=654887500 1968.0 Kapelusz Buenos Aires Geografía dinámica - general, Asia y Africa A.... Perpillou, Aime | Pernet, L. | Rampa, Alfredo C. [VI], 328 S. Ill., graph. Darst., Kt. Book Geography ISCED 2 - Lower secondary level Textbook Argentina
264 655829725 RA S-18(14,72)1 gei655829725 GEI PPN=655829725 1972.0 Losada Buenos Aires Educación democrática 1er año, [Schülerbd.] Jo... Delfino, Jorge Raúl 143 S. Book Social studies/politics ISCED 2 - Lower secondary level Textbook Argentina
153 654844038 RA RA-45(7,56) gei654844038 GEI PPN=654844038 1956.0 Lasserre Buenos Aires Colegial primer libro de lectura corriente Áng... Raggi, Ángela E. [X], 104 S. Ill., Kt. Book Mother tongue - Readers ISCED 1 - Primary level Reading Book Argentina
13 1027054358 RA S-43(3,2005)8 gei1027054358 GEI PPN=1027054358 2005.0 A-Z editora Ciudad Autónoma de Buenos Aires Sociedad en red EGB 3o ciclo 8, [Schülerband] ... Bruno, Paula | Calvo, Graciela | Castellani, A... 285 Seiten Illustrationen, Diagramme, Karten Book Social studies/politics | History | Geography ISCED 2 - Lower secondary level Textbook Argentina
34 1027341152 RA S-49(1,2013)2 gei1027341152 GEI PPN=1027341152 2013.0 A-Z editora Ciudad Autónoma de Buenos Aires Educación cívica 2, [Schülerband] Poder, estad... Fraga, Norberto E. | Ribas, Gabriel A. 157 Seiten Illustrationen Book Social studies/politics ISCED 2 - Lower secondary level Textbook Argentina

Normalization of the Publisher column¶

  • Clean up spaces and convert to lowercase
  • Normalization publishers with a mapping dictionary
  • Apply normalization in Publisher column
In [171]:
df['Publisher'] = df['Publisher'].str.strip().str.lower()
 
mapa_editoriales = {
    'a-z editora': 'A-Z Editora',
    'a-z ed.': 'A-Z Editora',
    'az editora': 'A-Z Editora',
    'estrada': 'Estrada',
    'estrada secundaria': 'Estrada',
    'angel estrada & cía.s.a.-editores': 'Estrada',
    'puerto de palos s.a. casa de édiciones': 'Puerto de Palos',
    'puerto de palos': 'Puerto de Palos',
    'aique primaria': 'Aique',
    'aique secundaria': 'Aique',
    'aique': 'Aique',
    'kapelusz': 'Kapelusz',
    'ed. kapelusz': 'Kapelusz',
    'kapelusz norma': 'Kapelusz',
    'tinta fresca': 'Tinta Fresca',
    'doce orcas ediciones': 'Doce Orcas',
    'doce orcas ed.': 'Doce Orcas',
    'doce orcas': 'Doce Orcas',
    'ed. stella': 'Stella',
    'ed. atlántida': 'Atlántida',
    'losada': 'Losada',
    'ed. troquel': 'Troquel',
    'imprenta mercur': 'Mercur',
    'imprenta de pablo e. coni, especial para obras': 'Coni',
    'coni': 'Coni',
    'igon': 'Igon',
    'igón': 'Igon',
    'goethe-inst.': 'Goethe-Institut',
    'cesarini': 'Cesarini',
    'cesarini hnos. ed.': 'Cesarini',
    'producciones mawis': 'Mawis',
    'editorial h.m.e.': 'HME',
    'imprenta y librería de mayo': 'Librería de Mayo',
    'librería del colegio, alsina y bolívar': 'Librería del Colegio',
    'cabaut, librería del colegio': 'Librería del Colegio',
    'alsina & bolívar, librería del colegio': 'Librería del Colegio',
    'librería del colegio': 'Librería del Colegio',
    'ed. crespillo': 'Crespillo',
    'f. crespillo': 'Crespillo',
    'f. crespillo editor': 'Crespillo',
    'ed. peuser': 'Peuser',
    'peuser': 'Peuser'
    }

df['Publisher'] = df['Publisher'].replace(mapa_editoriales).str.title()

Create a function to normalize author names according to specified rules¶

1. Remove accents and extra spaces
2. If there's a comma, we assume "Last, First" format
3. If last name has multiple parts, keep them together
4. Select only the first given name
5. Rebuild normalized name
6. If no comma, just title case the whole name

Apply function to Authors column

In [172]:
def normalizar_autor(nombre):
    if not isinstance(nombre, str) or not nombre.strip():
        return None

    # remove accents and extra spaces
    nombre = unidecode(nombre.strip())

    # If there's a comma, we assume "Last, First" format
    if ',' in nombre:
        apellido, resto = nombre.split(',', 1)
        apellido = apellido.strip()

        # if last name has multiple parts, keep them together
        apellido = re.sub(r'\s+', ' ', apellido)

        # select only the first given name
        resto = resto.strip()
        primer_nombre = resto.split()[0] if resto else ''

        # rebuild normalized name
        nombre_norm = f"{apellido.title()}, {primer_nombre.title()}"
    else:
        # if no comma, just title case the whole name
        nombre_norm = nombre.title()

    return nombre_norm.strip()

# Apply function to authors column ---
df['Authors'] = df['Authors'].fillna('').str.split('|').apply(lambda lst: [normalizar_autor(a) for a in lst if a])

Graph Secction¶

Graph Publishers by Number of Books¶

In [173]:
top = df['Publisher'].value_counts().head(25)
plt.figure(figsize=(14, 8))
sns.barplot(y=top.index, x=top.values, hue=top.index, palette="Set3", legend=False)
plt.xlabel('Books Count')
plt.ylabel('Publisher')
plt.title('Publishers by Number of Books')
plt.tight_layout()
plt.show()
No description has been provided for this image

Graph Publishers by Number of Books and Level of Education¶

In [174]:
sns.set(style="whitegrid", rc={
    "axes.facecolor": "white",
    "figure.facecolor": "white",
    "axes.edgecolor": "lightgray",
    "grid.color": "lightgray"
})

# temporal dataframe with counts
df_temp = (
    df.groupby(['Level of Education', 'Document Type', 'Publisher'])
      .size()
      .reset_index(name='Books Count')
)
 
df_temp = df_temp[df_temp['Books Count'] > 3] #filter publishers with more than 3 books

# create facet grid
g = sns.FacetGrid(
    df_temp,
    col='Level of Education',
    col_wrap=2,
    sharex=False,
    sharey=False,
    height=4.5,
    aspect=1.3,
    margin_titles=True,
    despine=False,
)

# facet_grid barplot
g.map_dataframe(
    sns.barplot,
    x='Books Count',
    y='Publisher',
    hue='Publisher',
    palette='Set3',
    legend=False,
    dodge=False,    
    edgecolor='gray',
    errorbar=None
)

g.set_titles(col_template="{col_name}", fontsize=11, fontweight='bold', pad=10)

# title styling and adjustments
for ax in g.axes.flat:
    if ax.get_legend():
        ax.legend_.remove()

    ax.tick_params(axis='y', labelsize=9)
    ax.tick_params(axis='x', labelsize=8)
    ax.grid(True, axis='x', linestyle=':', linewidth=0.5)
    
# remove duplicate legends and add a single legend at the bottom
handles, labels = g.axes.flat[0].get_legend_handles_labels()
if handles:
    g.fig.legend(handles, labels, loc='lower center', ncol=4, fontsize=9, frameon=False)

# global labels
g.set_axis_labels('Books Count', 'Publisher') 
plt.subplots_adjust(hspace=0.4, wspace=0.4, bottom=0.25)
plt.show()
No description has been provided for this image

Heatmap of Publishers vs School Subjects¶

In [175]:
plt.figure(figsize=(18, 12))
(df
 .explode('Publisher')
 .query("`School Subject` != 'German taught in non-German-speaking countries'")
 .pipe(lambda d: sns.heatmap(
    pd.crosstab(d['School Subject'], d['Publisher']),
    #cmap='PiYG',
    cmap='YlGnBu',
    annot=False,
    fmt='d',
    annot_kws={"size": 10},
    linewidth=0.5,
    linecolor='white',
    cbar_kws={'label': 'Books Count'},
    mask=(pd.crosstab(d['School Subject'], d['Publisher']) <= 3)
)))
plt.title('Count of Books by Publisher and Subject', fontsize=18, pad=20, weight='bold')
plt.xlabel('Publisher', fontsize=14, labelpad=12)
plt.ylabel('School Subject', fontsize=14, labelpad=12)
plt.xticks(rotation=60, ha='right', fontsize=12)
plt.yticks(rotation=0, fontsize=12)
sns.despine(left=True, bottom=True)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [176]:
# Histplot of books by Year and Publisher

plt.figure(figsize=(14, 8))

colors_palette = sns.color_palette("Paired", 12) + sns.color_palette("Dark2", 13)

#sns.set(style="ticks", palette="colors_palette")

ax = sns.histplot(df_exploded[df_exploded['Publisher'].isin(df_exploded['Publisher'].value_counts().head(25).index)],
    x='Year',
    hue='Publisher',
    multiple='stack',
    palette=colors_palette,
    bins=40,
    legend=True
)

# Force legend customization
leg = ax.get_legend()
if leg:
    leg.set_title("Publisher")
    leg._loc = 2 
    leg.set_bbox_to_anchor((1.05, 1))
    leg.set_frame_on(False)
    for t in leg.texts:
        t.set_fontsize(9)
    leg.set_title("Publisher")
else:
    print("⚠️ Legend not found.")

plt.title('Distribution of Books by Year and Publisher (before 1900)')
plt.xlabel('Year of Publication')
plt.ylabel('Book Count')
sns.despine()
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()
No description has been provided for this image
In [177]:
# Histplot of books by Year and Publisher

plt.figure(figsize=(14, 8))

colors_palette = sns.color_palette("Paired", 12) + sns.color_palette("Dark2", 13)

#sns.set(style="ticks", palette="colors_palette")

ax = sns.histplot(df_exploded[df_exploded['Publisher'].isin(df_exploded['Publisher'].value_counts().head(25).index)],
    x='Year',
    hue='Publisher',
    multiple='stack',
    palette=colors_palette,
    bins=40,
    legend=True
)

# Force legend customization
leg = ax.get_legend()
if leg:
    leg.set_title("Publisher")
    leg._loc = 2 
    leg.set_bbox_to_anchor((1.05, 1))
    leg.set_frame_on(False)
    for t in leg.texts:
        t.set_fontsize(9)
    leg.set_title("Publisher")
else:
    print("⚠️ Legend not found.")

plt.title('Distribution of Books by Year and Publisher (before 1900)')
plt.xlabel('Year of Publication')
plt.ylabel('Book Count')
sns.despine()
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()
No description has been provided for this image
In [178]:
# Heatmap of Publishers vs Level of Education
df['Level of Education'] = df['Level of Education'].fillna('').str.split('|').str[0]
plt.figure(figsize=(20, 12))
(df
 .explode('Publisher')
 #.query("`School Subject` != 'German taught in non-German-speaking countries'")
 .pipe(lambda d: pd.crosstab(d['Level of Education'], d['Publisher']))
 .pipe(lambda ctab: ctab[ctab >= 4].dropna(how='all').dropna(axis=1, how='all'))  
 .pipe(lambda filtered: sns.heatmap(
     filtered,
    #cmap='PiYG',
    cmap='YlGnBu',
    annot=False,
    fmt='d',
    annot_kws={"size": 8},
    linewidth=0.5,
    linecolor='white',
    cbar_kws={'label': 'Books Count'})
))
plt.title('Count of Books by Publisher and Level of Education', fontsize=18, pad=20, weight='bold')
plt.xlabel('Publisher', fontsize=13, labelpad=12)
plt.ylabel('Level of Education', fontsize=13, labelpad=12)
plt.xticks(rotation=60, ha='right', fontsize=12)
plt.yticks(rotation=0, fontsize=12)
sns.despine(left=True, bottom=True)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [179]:
# Heatmap of Publishers vs Document Type
df['Document Type'] = df['Document Type'].fillna('').str.split('|').str[0]
plt.figure(figsize=(16, 10))
(df
 .explode('Publisher')
 #.query("`School Subject` != 'German taught in non-German-speaking countries'")
 .pipe(lambda d: sns.heatmap(
    pd.crosstab(d['Document Type'], d['Publisher']),
    #cmap='PiYG',
    cmap='YlGnBu',
    annot=False,
    fmt='d',
    linecolor='gray',
    cbar_kws={'label': 'Books Count'},
    mask=(pd.crosstab(d['Document Type'], d['Publisher']) < 4)
)))
plt.title('Count of Books by Publisher and Document Type', fontsize=18, pad=20, weight='bold')
plt.xlabel('Publisher', fontsize=13, labelpad=10)
plt.ylabel('Document Type', fontsize=13, labelpad=10)
plt.xticks(rotation=60, ha='right', fontsize=9)
plt.yticks(rotation=0, fontsize=10)
sns.despine(left=True, bottom=True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Sakney plot for Publisher-Author collaborations¶

Overview of collaborations between publisher and author in the catalog

In [180]:
# Count summary
df_flow = (df.explode('Authors')
             .groupby(['Publisher', 'Authors'])
             .size()
             .reset_index(name='count'))
# Filter top
include_publishers = []  
include_authors = []    

if include_publishers:
    df_flow = df_flow[df_flow['Publisher'].isin(include_publishers)]

if include_authors:
    df_flow = df_flow[df_flow['Authors'].isin(include_authors)]

df_flow = df_flow[df_flow['count'] >= 2]   # filter publishers with more or equal than 2 publications

# create nodes and links
publishers = df_flow['Publisher'].dropna().unique().tolist()
authors = df_flow['Authors'].dropna().unique().tolist()
all_nodes = publishers + authors

source = df_flow['Publisher'].apply(lambda x: all_nodes.index(x))
target = df_flow['Authors'].apply(lambda x: all_nodes.index(x))
value = df_flow['count']

# colors for each group
publisher_color = "#4C72B0"   
author_color = "#9909A9"     
colors = [publisher_color] * len(publishers) + [author_color] * len(authors)

# create Sankey 
fig = go.Figure(go.Sankey(
    node=dict(
        label=all_nodes,
        pad=15,
        thickness=15,
        color=colors,  
        line=dict(color="white", width=0.5)
    ),
    link=dict(
        source=source,
        target=target,
        value=value,
        color="rgba(150,150,150,0.3)"
    )
))

fig.update_layout(
    title_text=f"Flow of Publications between Publishers and Authors<br><sup>{len(publishers)} publishers — {len(authors)} authors</sup>",
    font_size=10,
    height=700
)
fig.show(renderer='notebook')

Alluvial plot of Publishers vs Authors for decades¶

In [181]:
fig = px.parallel_categories(
    df.explode('Authors')
      .query('(Year >= 1860) & (Year < 1900)')
      .groupby(['Publisher', 'Authors'])
      .size()
      .reset_index(name='count'),
    dimensions=['Publisher', 'Authors'],
    color='count',
    color_continuous_scale=px.colors.sequential.Viridis,
    labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)


fig.update_layout(
    title=dict(
        text="Publisher–Author Collaborations (1860–1900)",
        x=0.5,
        xanchor='center',
        font=dict(size=18, family='Arial Black')
    ),
    font=dict(size=11, family='Arial'),
    coloraxis_colorbar=dict(
        title="Books Count",
        tickfont=dict(size=10)
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=60, r=60, t=80, b=50),
    height=700,
    dragmode=False, 
    coloraxis_showscale=False
)

fig.show(renderer='notebook')
In [182]:
fig = px.parallel_categories(
    df.explode('Authors')
      .query('(Year >= 1900) & (Year < 1940)')
      .groupby(['Publisher', 'Authors'])
      .size()
      .reset_index(name='count'),
    dimensions=['Publisher', 'Authors'],
    color='count',
    color_continuous_scale=px.colors.sequential.Cividis,
    labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)


fig.update_layout(
    title=dict(
        text="Publisher–Author Collaborations (1900–1940)",
        x=0.5,
        xanchor='center',
        font=dict(size=18, family='Arial Black')
    ),
    font=dict(size=11, family='Arial'),
    coloraxis_colorbar=dict(
        title="Books Count",
        tickfont=dict(size=10)
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=60, r=60, t=80, b=50),
    height=700,
    dragmode=False, 
    coloraxis_showscale=False
)

fig.show(renderer='notebook')
In [183]:
fig = px.parallel_categories(
    df.explode('Authors')
      .query('(Year >= 1940) & (Year < 1980)')
      .groupby(['Publisher', 'Authors'])
      .size()
      .reset_index(name='count')
      .query('count >= 2'),
    dimensions=['Publisher', 'Authors'],
    color='count',
    color_continuous_scale=px.colors.sequential.Plasma,
    labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)


fig.update_layout(
    title=dict(
        text="Publisher–Author Collaborations (1940–1980)",
        x=0.5,
        xanchor='center',
        font=dict(size=18, family='Arial Black')
    ),
    font=dict(size=11, family='Arial'),
    coloraxis_colorbar=dict(
        title="Books Count",
        tickfont=dict(size=10)
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=60, r=60, t=80, b=50),
    height=700,
    dragmode=False, 
    coloraxis_showscale=False
)

fig.show(renderer='notebook')
In [184]:
fig = px.parallel_categories(
    df.explode('Authors')
      .query('(Year >= 1980) & (Year < 2000)')
      .groupby(['Publisher', 'Authors'])
      .size()
      .reset_index(name='count')
      .query('count >= 2'),
    dimensions=['Publisher', 'Authors'],
    color='count',
    color_continuous_scale=px.colors.sequential.Plasma,
    labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)


fig.update_layout(
    title=dict(
        text="Publisher–Author Collaborations (1980–2000)",
        x=0.5,
        xanchor='center',
        font=dict(size=18, family='Arial Black')
    ),
    font=dict(size=11, family='Arial'),
    coloraxis_colorbar=dict(
        title="Books Count",
        tickfont=dict(size=10)
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=60, r=60, t=80, b=50),
    height=700,
    dragmode=False, 
    coloraxis_showscale=False
)

fig.show(renderer='notebook')
In [185]:
fig = px.parallel_categories(
    df.explode('Authors')
      .query('(Year >= 2000) & (Year < 2010)')
      .groupby(['Publisher', 'Authors'])
      .size()
      .reset_index(name='count')
      .query('count >= 2'),
    dimensions=['Publisher', 'Authors'],
    color='count',
    color_continuous_scale=px.colors.sequential.Plasma,
    labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)


fig.update_layout(
    title=dict(
        text="Publisher–Author Collaborations (2000–2010)",
        x=0.5,
        xanchor='center',
        font=dict(size=18, family='Arial Black')
    ),
    font=dict(size=11, family='Arial'),
    coloraxis_colorbar=dict(
        title="Books Count",
        tickfont=dict(size=10)
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=60, r=60, t=80, b=50),
    height=700,
    dragmode=False, 
    coloraxis_showscale=False
)

fig.show(renderer='notebook')
In [186]:
fig = px.parallel_categories(
    df.explode('Authors')
      .query('(Year >= 2010)')
      .groupby(['Publisher', 'Authors'])
      .size()
      .reset_index(name='count')
      .query('count >= 2'),
    dimensions=['Publisher', 'Authors'],
    color='count',
    color_continuous_scale=px.colors.sequential.Plasma,
    labels={'Publisher': 'Publisher', 'Authors': 'Author', 'count': 'Books Count'},
)


fig.update_layout(
    title=dict(
        text="Publisher–Author Collaborations after 2010",
        x=0.5,
        xanchor='center',
        font=dict(size=18, family='Arial Black')
    ),
    font=dict(size=11, family='Arial'),
    coloraxis_colorbar=dict(
        title="Books Count",
        tickfont=dict(size=10)
    ),
    paper_bgcolor='white',
    plot_bgcolor='white',
    margin=dict(l=60, r=60, t=80, b=50),
    height=700,
    dragmode=False, 
    coloraxis_showscale=False
)

fig.show(renderer='notebook')

Heatmap of relationship between School Subject and Authors¶

In [187]:
plt.figure(figsize=(16, 10))
""" vmax = 25 """
(df
 .explode('Authors')
 .query("`School Subject` != 'German taught in non-German-speaking countries'")
 .pipe(lambda d: d[
        d['School Subject'].isin(d['School Subject'].value_counts().nlargest(25).index) &
        d['Authors'].isin(d['Authors'].value_counts().nlargest(25).index) 
    ])
 .pipe(lambda d: sns.heatmap(
     pd.crosstab(d['School Subject'], d['Authors']),
     cmap='cubehelix_r',
     annot=False,
     fmt='d',
     linecolor='gray',
     cbar_kws={'label': 'Books Count'},
     #vmax=vmax
 ))
)
plt.title('Relationship between School Subjects and Authors', fontsize=18, pad=20, weight='bold')
plt.xlabel('Authors', fontsize=13, labelpad=10)
plt.ylabel('School Subject', fontsize=13, labelpad=10)

# 
plt.xticks(rotation=45, ha='right', fontsize=9)
plt.yticks(rotation=10, fontsize=10)

#  
sns.despine(left=True, bottom=True)

# 
plt.tight_layout()
plt.show()
No description has been provided for this image

New column to decades¶

In [188]:
# column with decades of publication
df['YearInterval'] = pd.cut(df['Year'], bins=list(range(1860, 2021, 10)))
print(df['YearInterval'].value_counts().sort_index())
YearInterval
(1860, 1870]     1
(1870, 1880]     1
(1880, 1890]     1
(1890, 1900]     6
(1900, 1910]     4
(1910, 1920]     6
(1920, 1930]     4
(1930, 1940]     4
(1940, 1950]    10
(1950, 1960]    34
(1960, 1970]    63
(1970, 1980]    23
(1980, 1990]    43
(1990, 2000]    14
(2000, 2010]    49
(2010, 2020]    72
Name: count, dtype: int64

Time series of book counts by authors¶

In [189]:
df_exploded = df.explode("Authors")

# Count books by YearInterval, Publisher, School Subject, and Authors
counts = (
    df_exploded.groupby(["YearInterval", "Publisher", "School Subject", "Authors"])
    .size()
    .reset_index(name="Count")
)
# transform YearInterval to its midpoint for plotting
counts['YearMid'] = counts['YearInterval'].apply(lambda x: x.mid if pd.notnull(x) else None)

# top authors for plotting
top_authors = (
    counts.groupby("Authors")["Count"].sum().nlargest(25).index
)
counts_top = counts[counts["Authors"].isin(top_authors)]
plt.figure(figsize=(16, 10))
sns.lineplot(
    data=counts_top,   # filter to top authors
    x="YearMid",
    y="Count",
    hue="Authors",
    marker="o",
     linewidth=2,
    palette="tab20",
)

# Labels and style
plt.title("Books Count over Time by Top Authors", pad=15, weight="bold")
plt.xlabel("Year (midpoint of interval)")
plt.ylabel("Number of Books")
plt.legend(title="Authors", bbox_to_anchor=(1.05, 1), loc="upper left", frameon=False)
sns.despine()
plt.tight_layout()
plt.show()
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\4095966031.py:5: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

No description has been provided for this image

Network Graph¶

Network graph of publishers and decades¶

In [190]:
# Build edgelist for decades and publishers
edgelist = (
    df_exploded.groupby(['YearInterval', 'Publisher'])
    .size()
    .reset_index(name='weight')
)

# build graph
G = nx.from_pandas_edgelist(
    edgelist,
    source="YearInterval",
    target="Publisher",
    edge_attr="weight"
)

# Graph visualization
plt.figure(figsize=(12,12))
pos = nx.spring_layout(G, k=1.0, iterations=50, seed=42)

# Edge weights
weights = [G[u][v]['weight']*0.5 for u,v in G.edges()]


# colors for decades vs publishers
node_colors = []
for node in G.nodes():
    if isinstance(node, (int, float)):
        node_colors.append('lightgreen')   # Decade
    elif node in df_exploded['Publisher'].unique():
        node_colors.append('lightblue')    # Publisher
    else:
        node_colors.append('lightcoral')   # Author


# proportional node sizes
node_sizes = [100 + 50*G.degree(n) for n in G.nodes()]

nx.draw(
    G, pos, with_labels=True,
    node_color=node_colors,
    node_size=node_sizes,
    edge_color='gray',
    width=weights,
    font_size=9
)

plt.title("Relationship between Publishers and Decades", fontsize=14)
plt.show()
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\4158794023.py:3: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

No description has been provided for this image

Graph tripartite, Decade ↔ Publisher ↔ Author¶

In [191]:
# prepare data for network graph
df_exploded = (
    df.explode('Authors')
      .dropna(subset=['Authors'])
      .query("Authors != ''")
)

# build edgelist for Decade ↔ Publisher
edges_decade_publisher = (
    df_exploded.groupby(['YearInterval', 'Publisher'])
    .size()
    .reset_index(name='weight')
)

# build edgelist for Publisher ↔ Author
edges_publisher_author = (
    df_exploded.groupby(['Publisher', 'Authors'])
    .size()
    .reset_index(name='weight')
)

common_publishers = set(edges_decade_publisher['Publisher']).intersection(edges_publisher_author['Publisher'])

edges_decade_publisher = edges_decade_publisher[
    edges_decade_publisher['Publisher'].isin(common_publishers)
]
edges_publisher_author = edges_publisher_author[
    edges_publisher_author['Publisher'].isin(common_publishers)
]

# graph tripartite
G = nx.Graph()

# new edges decade ↔ Publisher
for _, row in edges_decade_publisher.iterrows():
    G.add_edge(row['YearInterval'], row['Publisher'], weight=row['weight'])

# new edges Publisher ↔ Author
for _, row in edges_publisher_author.iterrows():
    G.add_edge(row['Publisher'], row['Authors'], weight=row['weight'])

# delete isolated nodes
G.remove_nodes_from(list(nx.isolates(G)))

# graph visualization
plt.figure(figsize=(20,20))
pos = nx.spring_layout(G, k=2.5, iterations=250, seed=42)

# weights for edges
pesos = [G[u][v]['weight']*0.1 for u,v in G.edges()]

# colors for node types
node_colors = []
for node in G.nodes():
    if isinstance(node, (int, float)):
        node_colors.append('lightcoral')   # Decade
    elif node in df_exploded['Publisher'].unique():
        node_colors.append('lightblue')    # Publisher
    else:
        node_colors.append('lightgreen')   # Author

node_sizes = [100 + 50*G.degree(n) for n in G.nodes()]

nx.draw(
    G, pos, with_labels=True,
    node_color=node_colors,
    node_size=node_sizes,
    edge_color='gray',
    width=pesos
)

plt.title("Graph tripartite, Decade ↔ Publisher ↔ Author")
plt.show()
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\1638740445.py:10: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

No description has been provided for this image
In [192]:
print(len(G.nodes()), "nodes and", len(G.edges()), "edges")
391 nodes and 884 edges

Network graph of Authors and Publishers¶

In [193]:
# prepare data for network graph
df_exploded = (
    df.explode('Authors')
      .dropna(subset=['Authors'])
      .query("Authors != ''")
)
df_network = df_exploded[['Publisher', 'Authors']].dropna().copy()

# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
    df_network, 
    source='Publisher', 
    target='Authors', 
    create_using=nx.Graph()
)

# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold

nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)

plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm

# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]

# draw nodes
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=publishers_nodes, 
    node_color='skyblue', 
    node_size=800, 
    label='Publishers'
)
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=authors_nodes, 
    node_color='lightcoral', 
    node_size=200, 
    label='Authors'
)

# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)

plt.title('Network of Authors and Publishers (Top 25 nodes)', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
No description has been provided for this image

Network graph by time lapse¶

In [194]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
    df_exploded['Year'].between(1860, 1940)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()

# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
    df_network, 
    source='Publisher', 
    target='Authors', 
    create_using=nx.Graph()
)

# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold

nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)

plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm

# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]

# draw nodes
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=publishers_nodes, 
    node_color='skyblue', 
    node_size=800, 
    label='Publishers'
)
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=authors_nodes, 
    node_color='lightcoral', 
    node_size=200, 
    label='Authors'
)

# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)

plt.title('Network of Authors and Publishers (Top 25 nodes) between 1860-1940', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
No description has been provided for this image
In [195]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
    df_exploded['Year'].between(1940, 1960)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()

# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
    df_network, 
    source='Publisher', 
    target='Authors', 
    create_using=nx.Graph()
)

# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold

nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)

plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm

# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]

# draw nodes
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=publishers_nodes, 
    node_color='skyblue', 
    node_size=800, 
    label='Publishers'
)
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=authors_nodes, 
    node_color='lightcoral', 
    node_size=200, 
    label='Authors'
)

# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)

plt.title('Network of Authors and Publishers (Top 25 nodes) between 1940-1960', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
No description has been provided for this image
In [196]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
    df_exploded['Year'].between(1960, 1980)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()

# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
    df_network, 
    source='Publisher', 
    target='Authors', 
    create_using=nx.Graph()
)

# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold

nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)

plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm

# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]

# draw nodes
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=publishers_nodes, 
    node_color='skyblue', 
    node_size=800, 
    label='Publishers'
)
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=authors_nodes, 
    node_color='lightcoral', 
    node_size=200, 
    label='Authors'
)

# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)

plt.title('Network of Authors and Publishers (Top 25 nodes) between 1960-1980', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
No description has been provided for this image
In [197]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
    df_exploded['Year'].between(1980, 2000)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()

# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
    df_network, 
    source='Publisher', 
    target='Authors', 
    create_using=nx.Graph()
)

# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold

nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)

plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm

# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]

# draw nodes
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=publishers_nodes, 
    node_color='skyblue', 
    node_size=800, 
    label='Publishers'
)
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=authors_nodes, 
    node_color='lightcoral', 
    node_size=200, 
    label='Authors'
)

# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)

plt.title('Network of Authors and Publishers (Top 25 nodes) between 1980-2000', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
No description has been provided for this image
In [198]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
    df_exploded['Year'].between(2000, 2010)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()

# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
    df_network, 
    source='Publisher', 
    target='Authors', 
    create_using=nx.Graph()
)

# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold

nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)

plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm

# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]

# draw nodes
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=publishers_nodes, 
    node_color='skyblue', 
    node_size=800, 
    label='Publishers'
)
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=authors_nodes, 
    node_color='lightcoral', 
    node_size=200, 
    label='Authors'
)

# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)

plt.title('Network of Authors and Publishers (Top 25 nodes) between 2000 - 2010', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
No description has been provided for this image
In [199]:
# prepare data for network graph
df_exploded = df.explode('Authors')
df_network = df_exploded[
    df_exploded['Year'].between(2010, 2020)
][['Year', 'Publisher', 'Authors']].dropna(subset=['Authors']).copy()

# build graph bipartite Publisher ↔ Author
G = nx.from_pandas_edgelist(
    df_network, 
    source='Publisher', 
    target='Authors', 
    create_using=nx.Graph()
)

# graph visualization focusing on top nodes
node_degrees = dict(G.degree())
threshold = sorted(node_degrees.values(), reverse=True)[25] # nodes with degree >= threshold

nodes_to_draw = [node for node, degree in node_degrees.items() if degree >= threshold or node in df_network]
subgraph = G.subgraph(nodes_to_draw)

plt.figure(figsize=(15, 10))
pos = nx.spring_layout(subgraph, k=0.8, iterations=20) # layout algorithm

# split nodes by type for coloring
publishers_nodes = [n for n in subgraph.nodes() if n in df_network['Publisher'].unique()]
authors_nodes = [n for n in subgraph.nodes() if n in df_network['Authors'].explode().unique()]

# draw nodes
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=publishers_nodes, 
    node_color='skyblue', 
    node_size=800, 
    label='Publishers'
)
nx.draw_networkx_nodes(
    subgraph, pos, 
    nodelist=authors_nodes, 
    node_color='lightcoral', 
    node_size=200, 
    label='Authors'
)

# drwar edges y labels
nx.draw_networkx_edges(subgraph, pos, alpha=0.5)
nx.draw_networkx_labels(subgraph, pos, font_size=12)

plt.title('Network of Authors and Publishers (Top 25 nodes) after 2010', fontsize=16, weight='bold')
plt.legend()
plt.tight_layout()
plt.axis('off')
plt.show()
No description has been provided for this image

Graph with Plotly Express: Animated Bar Chart of Subject Distribution by Publisher over Decades¶

In [200]:
fig = px.bar(
        df.groupby([
                'YearInterval', 
                df['School Subject'].str.split().str.slice(0, 3).str.join(' '),  # cut subject to first 3 words
                'Publisher'
            ]).size().reset_index(name='Books Count'),
    x="School Subject",
    y="Books Count",
    color="Publisher",
    animation_frame="YearInterval", # decade animation
    title="Subject Distribution by Publisher over Decades",
    category_orders={"YearInterval": sorted(df['YearInterval'].unique())}
)
# rotate x-axis labels to avoid overlap
fig.update_xaxes(
    tickangle=45,
    automargin=True
)

# move menu and slider down to avoid overlap with x-axis labels
fig.update_layout(
    margin=dict(b=100),  # move bottom margin
    updatemenus=[{
        "type": "buttons",
        "showactive": True,
        "x": -0.05,       # move buttons to the left
        "y": -0.35,      # move buttons 
        "xanchor": "left",
        "yanchor": "top"
    }],
    sliders=[{
        "x": 0.1,        # move slider 
        "y": -0.55,      # move slider down
        "xanchor": "left",
        "yanchor": "top"
    }]
)

fig.show(renderer='notebook')
C:\Users\Adm\AppData\Local\Temp\ipykernel_6312\2929836625.py:2: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

Histplot of books by Year and Publisher¶

In [201]:
# Histplot of books by Year and Publisher

plt.figure(figsize=(14, 8))

colors_palette = sns.color_palette("Paired", 12) + sns.color_palette("Dark2", 13)

#sns.set(style="ticks", palette="colors_palette")

ax = sns.histplot(df_exploded[df_exploded['Publisher'].isin(df_exploded['Publisher'].value_counts().head(25).index)],
    x='Year',
    hue='Publisher',
    multiple='stack',
    palette=colors_palette,
    bins=40,
    legend=True
)

# Force legend customization
leg = ax.get_legend()
if leg:
    leg.set_title("Publisher")
    leg._loc = 2 
    leg.set_bbox_to_anchor((1.05, 1))
    leg.set_frame_on(False)
    for t in leg.texts:
        t.set_fontsize(9)
    leg.set_title("Publisher")
else:
    print("⚠️ Legend not found.")

plt.title('Distribution of Books by Year and Publisher (before 1900)')
plt.xlabel('Year of Publication')
plt.ylabel('Book Count')
sns.despine()
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()
No description has been provided for this image